Two studies were conducted:
Question addressed here:
Can we use this body of patient information to predict infection?
This work is exploratory “proof of concept” to see if there is something here worth more investigation.
Infection variables (e.g., blood, urine) refer to type of infection, with primary interest in bloodstream infections. For the TT study I started EDA focusing on the “any” infection variable, i.e. starting at the most general level, and looked specifically at “blood” infection in some cases, the infection type of primary interest.
I looked at the day before onset, and in most cases restricted futher to a patient’s first onset.
Sepsis is identified differently for burn patients than others. Individual variables such as temperature which are normally used as indicators, are generally elevated or otherwise abnormal due to the burn.
Important notes from the study authors, Dr. Palmieri and Dr. Tran via Sandy Taylor:
Temperature was used as a criteria for culture in both studies. This means that the strength of temperature as an indicator or pre-indicator of infection is biased, since it was used to select who was screened. This may be true of other measurements. From Dr. Tran’s email: Cultures for PCR Sepsis were collected only when indicated at each site. Typically, that would be based on the presence of signs of sepsis such as fever (Temp >39.5C). For PCR Sepsis, we kept it a bit more simple especially for blood cultures. If the patient had a truly pathogenic organism from blood culture, then they were bacteremic, and considered septic if they met other criteria (I believe we used the same “Infection” form as Transfusion Trigger) such as fever, WBC, platelets, etc.
From Dr. Palmieri’s email: In terms of labs to focus on, would use platelets, wbc, sodium, chloride.
From Dr. Tran’s email: Routinely collected labs that may have value in predicting sepsis (more so burn sepsis) would be the usual CBC. Platelet count useful for sepsis severity and somewhat of a late marker for burn sepsis. We’re messing with some AI and machine learning work right now with the PCR Sepsis database and platelets remain a strong parameter to look at. Not sure if TRIBE has CBC indices (RDW, MCHC, MCV, etc) which may be of use. Electrolytes including sodium variability may be helpful from the chemistry panel. Respiratory rate, heart rate, etc are also useful. Pretty much the parameters from Dr. Greenhalgh’s consensus guidelines (JBCR 2007?).
There are three variables indicating time of blood-level screening in the data, in addition to the vital screening variable (V_TIME_PERFORMED). They refer to blood CBC (V_TIME_PERFORMED_1), blood chemistry (V_TIME_PERFORMED_2), and blood gas (V_TIME_PERFORMED_3) respectively, according to the variable descriptions.
Almost all entries have vitals (V_TIME_PERFORMED, 33 NA), and most have blood CBC (V_TIME_PERFORMED_1, 2186 NA) and blood chemistry (V_TIME_PERFORMED_2, 3542 NA). Only about one third have V_TIME_PERFORMED_3 (9111 NA) out of 14852 entries.
Only two entries have Blood infection reported when neither Blood CBC and Chemisty were performed, but in general (the other 120), this seems to be a precursor for detection.
Observations:
#Correlation of having such an infection
round(cor(TT_per[,c("n_blood","n_urine","n_wound", "n_pneumonia", "n_any")] > 0, use = "pairwise.complete.obs"), 2)
## n_blood n_urine n_wound n_pneumonia n_any
## n_blood 1.00 0.30 0.21 0.29 0.55
## n_urine 0.30 1.00 0.18 0.25 0.44
## n_wound 0.21 0.18 1.00 0.17 0.39
## n_pneumonia 0.29 0.25 0.17 1.00 0.68
## n_any 0.55 0.44 0.39 0.68 1.00
#Correlation of number of such infections
round(cor(TT_per[,c("n_blood","n_urine","n_wound", "n_pneumonia", "n_any")], use = "pairwise.complete.obs"), 2)
## n_blood n_urine n_wound n_pneumonia n_any
## n_blood 1.00 0.42 0.45 0.43 0.72
## n_urine 0.42 1.00 0.39 0.52 0.69
## n_wound 0.45 0.39 1.00 0.39 0.70
## n_pneumonia 0.43 0.52 0.39 1.00 0.86
## n_any 0.72 0.69 0.70 0.86 1.00
Observations:
Observations
#Days from first collection to first infection
table(TT_per$first_any)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18
## 5 10 10 11 7 11 9 12 4 8 6 6 6 6 6 2 6 4
## 19 20 22 23 24 25 26 29 30 31 33 34 43 46 52 85 118
## 2 1 2 1 1 2 3 1 2 2 1 1 4 1 2 1 1
#Days from first collection to first blood infection
table(TT_per$first_blood)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 15 16 17 19 20 21
## 2 5 1 1 3 2 5 1 5 2 2 3 4 1 3 1 1 2
## 24 25 26 27 30 33 38 39 40 43 50 52 56 70 73 85 86 117
## 1 1 3 1 2 1 2 2 1 2 1 1 1 1 1 1 1 1
## 118 363
## 1 1
days_from_admit_to_collection = difftime(TT_per$first_collection_date, TT_per$Admit_date, units = "days")
#Days from admit to first collection
table(days_from_admit_to_collection)
## days_from_admit_to_collection
## 0 1 2 3 4
## 225 104 9 6 2
#Days from admit to first infection
table(TT_per$first_any + days_from_admit_to_collection)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18
## 3 5 10 9 12 14 8 9 9 6 8 4 7 7 3 4 6 4
## 19 20 21 22 23 24 26 27 30 31 33 38 43 44 46 52 54 85
## 1 1 1 1 2 2 4 1 2 3 1 1 3 1 1 1 1 1
## 118
## 1
#Days from admit to first infection
table(TT_per$first_blood + days_from_admit_to_collection)
##
## 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20
## 3 4 2 1 4 3 3 4 2 1 4 1 1 3 3 1 1 1
## 21 24 26 27 30 31 33 38 39 40 41 43 50 52 56 70 76 85
## 2 1 4 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1
## 86 117 118 363
## 1 1 1 1
The charts below are for data in aggregate. Individual-level data may be more telling.
The red vertical line is the first blood infection and the blue is any infection.
The red vertical line is the first blood infection and the blue is any infection.
Distributions of vital statistics by infection status excluding outliers.
Based on Figure below we’d suspect that some variables may be correlated with infection or pre-infection, including heart rate, temperature, platelet count and sodium.
There are a few hundred NA entries (out of about 15,000) in “onset_tomorrow”, most due to not having next-day readings for an individual, which means that the data is not missing at random. Some NA data is also due to NA entries in the onset variable.
Distributions of vital statistics by infection status excluding outliers.
Distributions of vital statistics by infection status excluding outliers.
Investigation of ICU Days and PDR (predicted death rate) show that those with current or iminent infection are generally in worse health than the other groups. This is important for recognizing that factors like elevated heart rate may be due to general poor health instead of impending infection.
Next we look at the observed patterns more formally with a multinomial model where the possible outcomes are as labelled in Figure - current infection, day before infection onset, and neither of those cases. The NA case is exluded. I used a penalized version of multinomial regression from glmnet in with cross-validation to select variables. The tables below show the fitted non-zero coefficients using a less restrictive and more restrictive penalty term. The input data was standardized before fitting the model so that the coefficient magnitudes would be comparable, though they lose interpretability as a result.
## [1] "1. Model with all data"
## [1] "Coefficients -- smaller penalty"
## Current infection Day Before Infection Neither
## (Intercept) -8.0550 -0.6922 8.7472
## V_FIO2 0.0205 0.0252 -0.0457
## V_HEART_RATE 0.0519 0.0401 -0.0919
## V_GLUCOSE 0.0525 0.2889 -0.3414
## V_BLOOD_UREA_NITROGEN 0.0616 0.0111 -0.0727
## V_WHITE_BC 0.0704 0.1645 -0.2349
## V_RESPIRATORY_RATE 0.1183 0.0194 -0.1378
## V_MODS_SCORE 0.2207 0.0299 -0.2506
## V_SODIUM 0.6340 0.5607 -1.1948
## V_TEMPERATURE 6.0934 -1.3635 -4.7299
## [1] "Coefficients - larger penalty (lambda.1se*.8)"
## Current Infection Day Before Infection Neither
## (Intercept) -0.9785 -1.0297 2.0082
## V_GLUCOSE 0.0066 0.0147 -0.0213
## V_WHITE_BC 0.0259 0.0430 -0.0689
## V_MODS_SCORE 0.0622 0.0249 -0.0871
## V_TEMPERATURE 0.1207 -0.0047 -0.1160
## [1] "2. Model with data to first infection and infection days"
## [1] "Coefficients -- smaller penalty"
## Current Infection Day Before Infection Neither
## (Intercept) -3.4271 -1.8108 5.2379
## V_HEMOGLOBIN -1.3027 0.3129 0.9898
## V_PAO2 -0.4164 -0.1168 0.5333
## V_MEANARTERIAL_PRESSURE -0.4117 0.1465 0.2652
## V_PB_SYSTOLIC -0.2600 0.1547 0.1053
## V_CREATININE -0.2324 -0.1867 0.4190
## V_GLUCOSE 0.0006 0.0022 -0.0028
## V_FIO2 0.0168 0.0003 -0.0171
## V_MODS_SCORE 0.0525 0.0416 -0.0941
## V_PLATELET_COUNT 0.0879 -0.0291 -0.0588
## V_PACO2 0.1347 0.0171 -0.1518
## V_RESPIRATORY_RATE 0.2895 0.0348 -0.3242
## V_WHITE_BC 0.3048 -0.0089 -0.2959
## V_POTASSIUM 0.4361 0.0519 -0.4880
## V_BLOOD_UREA_NITROGEN 0.4417 -0.1387 -0.3030
## V_HEART_RATE 0.4529 0.1509 -0.6038
## V_GLASCOWCOMA_SCALE 0.8093 -0.6870 -0.1223
## V_SODIUM 1.0240 1.4221 -2.4461
## V_TEMPERATURE 1.9304 -0.6288 -1.3016
## [1] "Coefficients - larger penalty (lambda.1se*.5)"
## Current Infection Day Before Infection Neither
## (Intercept) -2.3480 -1.7468 4.0947
## V_HEMOGLOBIN -1.0116 0.2084 0.8032
## V_PAO2 -0.3908 -0.1064 0.4972
## V_MEANARTERIAL_PRESSURE -0.2955 0.0970 0.1985
## V_CREATININE -0.1283 -0.1007 0.2290
## V_PB_SYSTOLIC -0.0728 0.0341 0.0386
## V_PACO2 0.0445 0.0068 -0.0513
## V_PLATELET_COUNT 0.0678 -0.0187 -0.0491
## V_WHITE_BC 0.2374 -0.0031 -0.2342
## V_RESPIRATORY_RATE 0.2387 0.0210 -0.2597
## V_POTASSIUM 0.2761 0.0286 -0.3047
## V_HEART_RATE 0.2981 0.0793 -0.3774
## V_BLOOD_UREA_NITROGEN 0.3128 -0.0855 -0.2272
## V_GLASCOWCOMA_SCALE 0.4977 -0.3865 -0.1112
## V_SODIUM 0.9180 1.0231 -1.9411
## V_TEMPERATURE 1.3008 -0.3017 -0.9991
## [1] "3. Model with data to first infection"
## [1] "Coefficients -- smaller penalty (lambda.min*.5)"
## Day Before Infection Neither
## (Intercept) -1.579252814 1.579252814
## V_PACO2 -0.161176042 0.161176042
## V_HCO3 -0.020185000 0.020185000
## V_POTASSIUM 0.004014613 -0.004014613
## V_FIO2 0.046985008 -0.046985008
## V_CENTRAL_VENOUS_PRESSURE 0.080083762 -0.080083762
## V_GLUCOSE 0.097333418 -0.097333418
## V_SODIUM 0.339327711 -0.339327711
## [1] "Coefficients - larger penalty"
## Day Before Infection Neither
## -1.230005 1.230005
Not surprisingly, temperature has by far the largest coefficient values. Second is the MODS score (Multiple Organ Dysfunction Score), and I’m not sure what that means or if it makes sense in context. As expected by the study authors, respiratory rate, white blood cell count, and heart rate are somewhat correlated with the outcomes. In the more limited coefficient set, only the Glascow coma scale is a negative predictor, i.e. the higher the score the less likely an infection outcome. Also, all coefficients are stronger for current infection than day before infection, with opposite sign for no infection. The same is not true of the larger set.
Another noteable outcome is that the two variables included to control for severity of condition are included in the larger model but with moderate coefficient values, and they are absent in the smaller model
The figure compares the distribution of sepsis days per patient in the PCR study to the distribution of “onset” days per patient in the TT study. (According to the project readme, the variable “SEPSIS_STATUS” in the PCR data indicates whether the patient was determined to have a new onset of sepsis at that time point.) It’s similar overall, but with more patients in the PCR study having very high numbers of sepsis days.
I will simplify this section by only comparing sepsis days to non-sepsis days, and I will see if I get a similar list of variables associated with infection as in the TT study.
Distribution of vitals for sepsis vs. non-sepsis patients evaluated daily, excluding outliers
Figure compares the distribution of vital statistics for sepsis and non-sepsis days in patient history, after removing “outliers”, i.e .the bottom and top two percent of each distribution. White blood cell count seems to behave similarly as in the TT study, but heart rate and platelet count seem to have the opposite association. For example, average heart seems to be lower for the sepsis group.
Below are fit coefficients from only the less-restrictive cross-validated multinomial model. The more restrictive model is left out since this model is already fairly small.
Note that although temperature did not appear to be significant based on the distribution plots, it is again the most important term in the model. As above, coefficients are shown on a standardized scale.
sum(TT$Onset=="Yes" & TT$Any=="No", na.rm = T)
## [1] 47
# PCR STUDY
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 101 11 8 1 4 5 3 7 8 12 8 6 2 2 5
## 16 17 18 19 20 21 22 23 24 25 27 28 30 34 35
## 2 3 2 3 2 3 1 1 1 1 1 2 1 1 2
## 36 37 40 43 48 52 <NA>
## 1 1 2 1 1 1 0
# TT STudy
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15
## 5 10 10 11 7 11 9 12 4 8 6 6 6 6 6
## 16 17 18 19 20 22 23 24 25 26 29 30 31 33 34
## 2 6 4 2 1 2 1 1 2 3 1 2 2 1 1
## 43 46 52 85 118 <NA>
## 4 1 2 1 1 189